STAM101 :: Lecture 12 :: Correlation – definition – Scatter diagram -Pearson’s correlation co-efficient – properties of correlation coefficient
Correlation
Correlation is the study of relationship between two or more variables. Whenever we conduct any experiment we gather information on more related variables. When there are two related variables their joint distribution is known as bivariate normal distribution and if there are more than two variables their joint distribution is known as multivariate normal distribution.
In case of bi-variate or multivariate normal distribution, we are interested in discovering and measuring the magnitude and direction of relationship between 2 or more variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y, the variables are said to be correlated. In other words, the systematic relationship between the variables is termed as correlation. When only 2 variables are involved the correlation is known as simple correlation and when more than 2 variables are involved the correlation is known as multiple correlation. When the variables move in the same direction, these variables are said to be correlated positively and if they move in the opposite direction they are said to be negatively correlated.
Scatter Diagram
To investigate whether there is any relation between the variables X and Y we use scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the resultant diagram of dots is known as scatter diagram. From the scatter diagram we can say whether there is any correlation between x and y and whether it is positive or negative or the correlation is linear or curvilinear.
Positive Correlation Negative correlation
Curvilinear no correlation
(or) non linear
Pearsons Correlation coefficient
The measures of the degree of relationship between two continuous variables is called correlation coefficient. It is denoted by r.( in case of sample )and r (in case of population). The correlation coefficient r is known as Pearson’s correlation coefficient as it was discovered by Karl Pearson. It is also called as product moment correlation.
The correlation coefficient r is given as the ratio of covariance of the variables X and Y to the product of the standard deviation of X and Y.
Symbolically,
which can be simplified as
This correlation coefficient r is known as Pearson’s Correlation coefficient. The numerator is termed as sum of product of X and Y and abbreviated as SP(XY). In the denominator the first term is called sum off squares of X (i.e) SS(X) and second term is called sum of squares of Y (i.e) SS(Y)
\
The denominator in the above formula is always positive. The numerator may be positive or negative making r to be either positive or negative.
Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions, they are
- The variables under study are continuous random variables and they are normally distributed
- The relationship between the variables is linear
- Each pair of observations is unconnected with other pair (independent)
Properties
- The correlation coefficient value ranges between –1 and +1.
- The correlation coefficient is not affected by change of origin or scale or both.
- If r > 0 it denotes positive correlation
r< 0 it denotes negative correlation between the two variables x and y.
r = 0 then the two variables x and y are not linearly correlated.(i.e)two
variables are independent.
r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.
Testing the significance of r
The significance of r can be tested by Student’s t test. The test statistics is given by
This t is distributed as Student’s t distribution with (n-2) degrees of freedom.
The relationship between the variables is interpreted by the square of the correlation coefficient (r2) which is called coefficient of determination. The value 1-r2 is called as coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the variation in one variable is caused by the variation of the other variable. The coefficient of determination is used to compare 2 correlation coefficients.
Problem
Compute Pearsons coefficient of correlation between plant height (cm) and yield (Kgs) as per the data given below:
Plant Height (cm) |
39 |
65 |
62 |
90 |
82 |
75 |
25 |
98 |
36 |
78 |
Yield in Kgs |
47 |
53 |
58 |
86 |
62 |
68 |
60 |
91 |
51 |
84 |
Solution
Ho: The correlation coefficient r is not significant
H1: The correlation coefficient r is significant.
Level of significance 5%
From the data
n = 10
Correlation coefficient is positively correlated.
Test Statistic
ttab=t(10-2, 5%los)=2.306
Inference
t> ttab, we reject null hypothesis.
\The correlation coefficient r is significant. (i.e) there is a relation between plant height and yield.
Download this lecture as PDF here |